HEAD <<<<<<< HEAD
Population pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia).
In this take-home exercise, I am going to design an age-sex pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
For this task, the data sets entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 are used. These data sets are available at Department of Statistics home page. The data set is available at Department of Statistics home page.
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages = c('ggiraph', 'plotly',
'DT', 'patchwork',
'gganimate', 'tidyverse',
'readxl', 'gifski', 'gapminder')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The code chunk below imports respopagesextod2000to2010.csv and respopagesextod2011to2020.csv into R environment by using read_csv() function of readr package.
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio.
df1 <- read_csv("data/respopagesextod2000to2010.csv")
df2 <- read_csv("data/respopagesextod2011to2020.csv")
tbl_df(df1)
# A tibble: 1,040,592 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Cheng San 0_to_4 Males HDB 1- and 2-Room ~ 20 2000
2 Ang Mo Kio Cheng San 0_to_4 Males HDB 3-Room Flats 480 2000
3 Ang Mo Kio Cheng San 0_to_4 Males HDB 4-Room Flats 220 2000
4 Ang Mo Kio Cheng San 0_to_4 Males HDB 5-Room and Exe~ 80 2000
5 Ang Mo Kio Cheng San 0_to_4 Males HUDC Flats (exclud~ 0 2000
6 Ang Mo Kio Cheng San 0_to_4 Males Landed Properties 0 2000
7 Ang Mo Kio Cheng San 0_to_4 Males Condominiums and O~ 0 2000
8 Ang Mo Kio Cheng San 0_to_4 Males Others 0 2000
9 Ang Mo Kio Cheng San 0_to_4 Females HDB 1- and 2-Room ~ 20 2000
10 Ang Mo Kio Cheng San 0_to_4 Females HDB 3-Room Flats 390 2000
# ... with 1,040,582 more rows
tbl_df(df2)
# A tibble: 984,656 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 1~ 0 2011
2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 3~ 10 2011
3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 4~ 30 2011
4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 5~ 50 2011
5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HUDC ~ 0 2011
6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Lande~ 0 2011
7 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Condo~ 40 2011
8 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Others 0 2011
9 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 1~ 0 2011
10 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 3~ 10 2011
# ... with 984,646 more rows
As we can see from above, the two data frames have the same column names. We will firstly combine the two data frames into one by rows using rbind.
PA SZ AG
Length:2025248 Length:2025248 Length:2025248
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Sex TOD Pop
Length:2025248 Length:2025248 Min. : 0.00
Class :character Class :character 1st Qu.: 0.00
Mode :character Mode :character Median : 0.00
Mean : 38.41
3rd Qu.: 10.00
Max. :3160.00
Time
Min. :2000
1st Qu.:2005
Median :2010
Mean :2010
3rd Qu.:2015
Max. :2020
We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.
dfcombined <- dfcombined %>%
mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
AG == "5_to_9" ~ "05_to_09",
TRUE ~ AG) )
dfcombined <- dfcombined %>%
mutate(Pop1 = case_when(Sex == "Males" ~ -Pop, TRUE ~ Pop))
dfcombined_selected <- dfcombined %>%
select(- SZ, - TOD)
dfgrouped_selected <- dfcombined_selected %>%
group_by(PA, Time, AG, Sex) %>%
summarise(Pop1=sum(Pop1), .groups = 'keep') %>%
ungroup()
A static plot was firstly created using ggplot2 bar chart. There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.
Next, transition_time() method of gganimate extension is added on top of the static plot to determine the behavior of the animation. The transition should happen across time. Linear easing is chosen to define the pace of change.
The full code chunk and output are shown below.
ggplot(data=dfgrouped_selected,
aes(x = AG, y = Pop1, fill = Sex)) +
geom_col() +
xlab("Age") +
scale_y_continuous(name = "Population",
breaks = c(seq(-160000, 160000, 20000)),
labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
coord_flip() +
theme_bw() +
labs(title = 'Demographic Structure of Singapore 2000-2020\n\n {as.integer(frame_time)}',
caption = '\n\n Data Source: https://www.singstat.gov.sg/') +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
transition_time(Time) +
ease_aes('linear')
From above animation, we can see that Singapore population have been aging from 2000 to 2020 for both males and females as the Pyramid center moves higher with time passing by. There are more female elderly than male elderly.
In this section, we are going to zoom into more details to see how the demographic structure varies by planning area.
p <- dfgrouped_selected %>%
ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1),
"\nSex: ", Sex, "\nYear: ", Time))) +
xlab("Age") +
coord_flip() +
scale_y_continuous(name = "Population",
breaks = c(seq(-16000, 16000, 2000)),
labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
facet_wrap(~PA, ncol=2,scale = "free_x") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))
ggplotly(p, height = 10000, width = 800, tooltip = "text")
Given the size of number of planning areas, let’s focus our analysis on the top5 populated areas of 2000 and 2020.
data_2000 <- dfcombined_selected %>%
filter(Time == "2000")
data_2000_grouped <- data_2000 %>%
group_by(PA) %>%
summarise(`TotalPop` = sum(Pop)) %>%
ungroup()
Top5_2000 <- data_2000_grouped %>%
arrange(desc(TotalPop))
head(Top5_2000)
# A tibble: 6 x 2
PA TotalPop
<chr> <dbl>
1 Bedok 285440
2 Tampines 253320
3 Jurong West 204900
4 Hougang 204520
5 Woodlands 188870
6 Ang Mo Kio 181100
Top 5 populated areas of 2000 were “Bedok”, “Tampines”, “Jurong West”, “Hougang”, and “Woodlands”.
data_2020 <- dfcombined_selected %>%
filter(Time == "2020")
data_2020_grouped <- data_2020 %>%
group_by(PA) %>%
summarise(`TotalPop` = sum(Pop)) %>%
ungroup()
Top5_2020 <- data_2020_grouped %>%
arrange(desc(TotalPop))
head(Top5_2020)
# A tibble: 6 x 2
PA TotalPop
<chr> <dbl>
1 Bedok 277720
2 Jurong West 263050
3 Tampines 260380
4 Woodlands 255350
5 Sengkang 249670
6 Hougang 228130
Top 5 populated areas of 2020 were “Bedok”, “Jurong West”, “Tampines”, “Woodlands” and “Sengkang”.
dfgrouped_selected1 <- dfgrouped_selected %>%
filter(PA == "Bedok" | PA == "Jurong West" | PA == "Tampines" |
PA == "Woodlands" | PA =="Sengkang" | PA == "Hougang")
p1 <- dfgrouped_selected1 %>%
ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1),
"\nSex: ", Sex, "\nYear: ", Time))) +
xlab("Age") +
coord_flip() +
scale_y_continuous(name = "Population",
breaks = c(seq(-16000, 16000, 2000)),
labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
facet_wrap(~PA, ncol=2,scale = "free_x") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))
ggplotly(p1, height = 1500, width = 800, tooltip = "text")
For the plot above, we have below observations:
Out of total 57 planning areas, there have been 14 areas with no population since 2000. They are “Boon Lay”,“Boon Lay/Pioneer”, “Central Water Catchment”, “Changi Bay”, “Marina East”, “Marina South”, “Paya Lebar”, “Pioneer”, “Simpang”, “Straits View”, “Tengah”, “Tuas”, “Western Islands”, and “North East Island”.
There were no population in “Southern Islands” and “Punggol” in 2000. In 2020, Punggol was dominated by the middle-age and children. Few thousands of people were living in the Southern Islands in 2020.
“Bedok”, “Hougang”, “Jurong West”, and “Woodlands” demographic structure changes are similar to that of the whole Singapore.
“Sengkang” experienced a net growth of population from 2000-2020. There are less aging people in Sengkang compared to the other 5 planning areas. The dominating public there are the middle-age.
There are various method to define young and aged adult. For the purpose of this study, we define young adults to be between 20 and 39, based on Erikson’s research and define aged adults to be older than 65 (Wikipedia).
Mutate() is used to add new variables and preserve existing ones.
Select() is used to keep the desired columns.
group_by() is used to group data.
pivot_wider() is used to unwrap the age group.
dfcombined_selected1 <- dfcombined_selected %>%
select(- Pop1)
df3<- dfcombined_selected1 %>%
mutate(Group = case_when(AG == "20_to_24" |AG == "25_to_29"|AG == "30_to_34"|AG == "35_to_39" ~ "Young",
AG == "65_to_69" |AG == "70_to_74"|AG == "75_to_79"|AG == "80_to_84"|
AG == "85_to_89"|AG == "90_and_over" ~ "Old",
TRUE ~ "Others")) %>%
select(- AG, -Sex)%>%
group_by(Time, PA, Group) %>%
summarise(GroupPop = sum(Pop))%>%
ungroup() %>%
pivot_wider(names_from = Group, values_from = GroupPop) %>%
mutate (Total = Old+Others+Young) %>%
mutate (Pct_y = Young/Total*100) %>%
mutate (Pct_o = Old/Total*100)
df3
# A tibble: 1,155 x 8
Time PA Old Others Young Total Pct_y Pct_o
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 Ang Mo Kio 14430 107010 59660 181100 32.9 7.97
2 2000 Bedok 22960 173840 88640 285440 31.1 8.04
3 2000 Bishan 6170 56150 27960 90280 31.0 6.83
4 2000 Boon Lay/Pioneer 0 0 0 0 NaN NaN
5 2000 Bukit Batok 6140 77400 43140 126680 34.1 4.85
6 2000 Bukit Merah 18720 82820 47330 148870 31.8 12.6
7 2000 Bukit Panjang 4850 58290 33620 96760 34.7 5.01
8 2000 Bukit Timah 5250 40640 18700 64590 29.0 8.13
9 2000 Central Water Catchme~ 0 0 0 0 NaN NaN
10 2000 Changi 30 640 380 1050 36.2 2.86
# ... with 1,145 more rows
Next, we used the prepared data frame df3 to plot the animated bubble plot to show the evolution of % Young versus % Aged population.
ggplot(df3, aes(x = Pct_o, y = Pct_y,
size = Total,
colour = PA)) +
geom_point(alpha = 0.7,
show.legend = FALSE) +
scale_size(range = c(2, 15)) +
theme_bw() +
labs(title = 'Singapore Young and Aged Adults Share\n\n {as.integer(frame_time)}',
caption = '\n\n Data Source: https://www.singstat.gov.sg/',
x = '% Aged',
y = '% Young') +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
transition_time(Time) +
ease_aes('linear')
As we can see from above chart, % Aged increased and % Young decreased from 2000 to 2020. The decrease of % Young from 2010 onwards is more obvious, compared to that from 2000 to 2009.
Singapore is an aging population for both males and females. There are more female elderly than male elderly. Mature residential areas are representative of overall Singapore demographics changes over time. Recently developed planning areas, for example Punggol and Sengkang, have more young and middle-aged population than the matured planning areas.
Population pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia).
In this take-home exercise, I am going to design an age-sex pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
For this task, the data sets entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 are used. These data sets are available at Department of Statistics home page. The data set is available at Department of Statistics home page.
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages = c('ggiraph', 'plotly',
'DT', 'patchwork',
'gganimate', 'tidyverse',
'readxl', 'gifski', 'gapminder')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The code chunk below imports respopagesextod2000to2010.csv and respopagesextod2011to2020.csv into R environment by using read_csv() function of readr package.
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio.
df1 <- read_csv("data/respopagesextod2000to2010.csv")
df2 <- read_csv("data/respopagesextod2011to2020.csv")
tbl_df(df1)
# A tibble: 1,040,592 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Cheng San 0_to_4 Males HDB 1- and 2-Room ~ 20 2000
2 Ang Mo Kio Cheng San 0_to_4 Males HDB 3-Room Flats 480 2000
3 Ang Mo Kio Cheng San 0_to_4 Males HDB 4-Room Flats 220 2000
4 Ang Mo Kio Cheng San 0_to_4 Males HDB 5-Room and Exe~ 80 2000
5 Ang Mo Kio Cheng San 0_to_4 Males HUDC Flats (exclud~ 0 2000
6 Ang Mo Kio Cheng San 0_to_4 Males Landed Properties 0 2000
7 Ang Mo Kio Cheng San 0_to_4 Males Condominiums and O~ 0 2000
8 Ang Mo Kio Cheng San 0_to_4 Males Others 0 2000
9 Ang Mo Kio Cheng San 0_to_4 Females HDB 1- and 2-Room ~ 20 2000
10 Ang Mo Kio Cheng San 0_to_4 Females HDB 3-Room Flats 390 2000
# ... with 1,040,582 more rows
tbl_df(df2)
# A tibble: 984,656 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 1~ 0 2011
2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 3~ 10 2011
3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 4~ 30 2011
4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 5~ 50 2011
5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HUDC ~ 0 2011
6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Lande~ 0 2011
7 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Condo~ 40 2011
8 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Others 0 2011
9 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 1~ 0 2011
10 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 3~ 10 2011
# ... with 984,646 more rows
As we can see from above, the two data frames have the same column names. We will firstly combine the two data frames into one by rows using rbind.
PA SZ AG
Length:2025248 Length:2025248 Length:2025248
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Sex TOD Pop
Length:2025248 Length:2025248 Min. : 0.00
Class :character Class :character 1st Qu.: 0.00
Mode :character Mode :character Median : 0.00
Mean : 38.41
3rd Qu.: 10.00
Max. :3160.00
Time
Min. :2000
1st Qu.:2005
Median :2010
Mean :2010
3rd Qu.:2015
Max. :2020
We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.
dfcombined <- dfcombined %>%
mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
AG == "5_to_9" ~ "05_to_09",
TRUE ~ AG) )
dfcombined <- dfcombined %>%
mutate(Pop1 = case_when(Sex == "Males" ~ -Pop, TRUE ~ Pop))
dfcombined_selected <- dfcombined %>%
select(- SZ, - TOD)
dfgrouped_selected <- dfcombined_selected %>%
group_by(PA, Time, AG, Sex) %>%
summarise(Pop1=sum(Pop1), .groups = 'keep') %>%
ungroup()
A static plot was firstly created using ggplot2 bar chart. There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.
Next, transition_time() method of gganimate extension is added on top of the static plot to determine the behavior of the animation. The transition should happen across time. Linear easing is chosen to define the pace of change.
The full code chunk and output are shown below.
ggplot(data=dfgrouped_selected,
aes(x = AG, y = Pop1, fill = Sex)) +
geom_col() +
xlab("Age") +
scale_y_continuous(name = "Population",
breaks = c(seq(-160000, 160000, 20000)),
labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
coord_flip() +
theme_bw() +
labs(title = 'Demographic Structure of Singapore 2000-2020\n\n {as.integer(frame_time)}',
caption = '\n\n Data Source: https://www.singstat.gov.sg/') +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
transition_time(Time) +
ease_aes('linear')
From above animation, we can see that Singapore population have been aging from 2000 to 2020 for both males and females as the Pyramid center moves higher with time passing by. There are more female elderly than male elderly.
In this section, we are going to zoom into more details to see how the demographic structure varies by planning area.
p <- dfgrouped_selected %>%
ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1),
"\nSex: ", Sex, "\nYear: ", Time))) +
xlab("Age") +
coord_flip() +
scale_y_continuous(name = "Population",
breaks = c(seq(-16000, 16000, 2000)),
labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
facet_wrap(~PA, ncol=2,scale = "free_x") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))
ggplotly(p, height = 10000, width = 800, tooltip = "text")
Given the size of number of planning areas, let’s focus our analysis on the top5 populated areas of 2000 and 2020.
data_2000 <- dfcombined_selected %>%
filter(Time == "2000")
data_2000_grouped <- data_2000 %>%
group_by(PA) %>%
summarise(`TotalPop` = sum(Pop)) %>%
ungroup()
Top5_2000 <- data_2000_grouped %>%
arrange(desc(TotalPop))
head(Top5_2000)
# A tibble: 6 x 2
PA TotalPop
<chr> <dbl>
1 Bedok 285440
2 Tampines 253320
3 Jurong West 204900
4 Hougang 204520
5 Woodlands 188870
6 Ang Mo Kio 181100
Top 5 populated areas of 2000 were “Bedok”, “Tampines”, “Jurong West”, “Hougang”, and “Woodlands”.
data_2020 <- dfcombined_selected %>%
filter(Time == "2020")
data_2020_grouped <- data_2020 %>%
group_by(PA) %>%
summarise(`TotalPop` = sum(Pop)) %>%
ungroup()
Top5_2020 <- data_2020_grouped %>%
arrange(desc(TotalPop))
head(Top5_2020)
# A tibble: 6 x 2
PA TotalPop
<chr> <dbl>
1 Bedok 277720
2 Jurong West 263050
3 Tampines 260380
4 Woodlands 255350
5 Sengkang 249670
6 Hougang 228130
Top 5 populated areas of 2020 were “Bedok”, “Jurong West”, “Tampines”, “Woodlands” and “Sengkang”.
dfgrouped_selected1 <- dfgrouped_selected %>%
filter(PA == "Bedok" | PA == "Jurong West" | PA == "Tampines" |
PA == "Woodlands" | PA =="Sengkang" | PA == "Hougang")
p1 <- dfgrouped_selected1 %>%
ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1),
"\nSex: ", Sex, "\nYear: ", Time))) +
xlab("Age") +
coord_flip() +
scale_y_continuous(name = "Population",
breaks = c(seq(-16000, 16000, 2000)),
labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
facet_wrap(~PA, ncol=2,scale = "free_x") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))
ggplotly(p1, height = 1500, width = 800, tooltip = "text")
For the plot above, we have below observations:
Out of total 57 planning areas, there have been 14 areas with no population since 2000. They are “Boon Lay”,“Boon Lay/Pioneer”, “Central Water Catchment”, “Changi Bay”, “Marina East”, “Marina South”, “Paya Lebar”, “Pioneer”, “Simpang”, “Straits View”, “Tengah”, “Tuas”, “Western Islands”, and “North East Island”.
There were no population in “Southern Islands” and “Punggol” in 2000. In 2020, Punggol was dominated by the middle-age and children. Few thousands of people were living in the Southern Islands in 2020.
“Bedok”, “Hougang”, “Jurong West”, and “Woodlands” demographic structure changes are similar to that of the whole Singapore.
“Sengkang” experienced a net growth of population from 2000-2020. There are less aging people in Sengkang compared to the other 5 planning areas. The dominating public there are the middle-age.
There are various method to define young and aged adult. For the purpose of this study, we define young adults to be between 20 and 39, based on Erikson’s research and define aged adults to be older than 65 (Wikipedia).
Mutate() is used to add new variables and preserve existing ones.
Select() is used to keep the desired columns.
group_by() is used to group data.
pivot_wider() is used to unwrap the age group.
dfcombined_selected1 <- dfcombined_selected %>%
select(- Pop1)
df3<- dfcombined_selected1 %>%
mutate(Group = case_when(AG == "20_to_24" |AG == "25_to_29"|AG == "30_to_34"|AG == "35_to_39" ~ "Young",
AG == "65_to_69" |AG == "70_to_74"|AG == "75_to_79"|AG == "80_to_84"|
AG == "85_to_89"|AG == "90_and_over" ~ "Old",
TRUE ~ "Others")) %>%
select(- AG, -Sex)%>%
group_by(Time, PA, Group) %>%
summarise(GroupPop = sum(Pop))%>%
ungroup() %>%
pivot_wider(names_from = Group, values_from = GroupPop) %>%
mutate (Total = Old+Others+Young) %>%
mutate (Pct_y = Young/Total*100) %>%
mutate (Pct_o = Old/Total*100)
df3
# A tibble: 1,155 x 8
Time PA Old Others Young Total Pct_y Pct_o
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 Ang Mo Kio 14430 107010 59660 181100 32.9 7.97
2 2000 Bedok 22960 173840 88640 285440 31.1 8.04
3 2000 Bishan 6170 56150 27960 90280 31.0 6.83
4 2000 Boon Lay/Pioneer 0 0 0 0 NaN NaN
5 2000 Bukit Batok 6140 77400 43140 126680 34.1 4.85
6 2000 Bukit Merah 18720 82820 47330 148870 31.8 12.6
7 2000 Bukit Panjang 4850 58290 33620 96760 34.7 5.01
8 2000 Bukit Timah 5250 40640 18700 64590 29.0 8.13
9 2000 Central Water Catchme~ 0 0 0 0 NaN NaN
10 2000 Changi 30 640 380 1050 36.2 2.86
# ... with 1,145 more rows
Next, we used the prepared data frame df3 to plot the animated bubble plot to show the evolution of % Young versus % Aged population.
ggplot(df3, aes(x = Pct_o, y = Pct_y,
size = Total,
colour = PA)) +
geom_point(alpha = 0.7,
show.legend = FALSE) +
scale_size(range = c(2, 15)) +
theme_bw() +
labs(title = 'Singapore Young and Aged Adults Share\n\n {as.integer(frame_time)}',
caption = '\n\n Data Source: https://www.singstat.gov.sg/',
x = '% Aged',
y = '% Young') +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
transition_time(Time) +
ease_aes('linear')
As we can see from above chart, % Aged increased and % Young decreased from 2000 to 2020. The decrease of % Young from 2010 onwards is more obvious, compared to that from 2000 to 2009.
Singapore is an aging population for both males and females. There are more female elderly than male elderly. Mature residential areas are representative of overall Singapore demographics changes over time. Recently developed planning areas, for example Punggol and Sengkang, have more young and middle-aged population than the matured planning areas.
Population pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia).
In this take-home exercise, I am going to design an age-sex pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
For this task, the data sets entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 are used. These data sets are available at Department of Statistics home page. The data set is available at Department of Statistics home page.
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages = c('ggiraph', 'plotly',
'DT', 'patchwork',
'gganimate', 'tidyverse',
'readxl', 'gifski', 'gapminder')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The code chunk below imports respopagesextod2000to2010.csv and respopagesextod2011to2020.csv into R environment by using read_csv() function of readr package.
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio.
df1 <- read_csv("data/respopagesextod2000to2010.csv")
df2 <- read_csv("data/respopagesextod2011to2020.csv")
tbl_df(df1)
# A tibble: 1,040,592 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Cheng San 0_to_4 Males HDB 1- and 2-Room ~ 20 2000
2 Ang Mo Kio Cheng San 0_to_4 Males HDB 3-Room Flats 480 2000
3 Ang Mo Kio Cheng San 0_to_4 Males HDB 4-Room Flats 220 2000
4 Ang Mo Kio Cheng San 0_to_4 Males HDB 5-Room and Exe~ 80 2000
5 Ang Mo Kio Cheng San 0_to_4 Males HUDC Flats (exclud~ 0 2000
6 Ang Mo Kio Cheng San 0_to_4 Males Landed Properties 0 2000
7 Ang Mo Kio Cheng San 0_to_4 Males Condominiums and O~ 0 2000
8 Ang Mo Kio Cheng San 0_to_4 Males Others 0 2000
9 Ang Mo Kio Cheng San 0_to_4 Females HDB 1- and 2-Room ~ 20 2000
10 Ang Mo Kio Cheng San 0_to_4 Females HDB 3-Room Flats 390 2000
# ... with 1,040,582 more rows
tbl_df(df2)
# A tibble: 984,656 x 7
PA SZ AG Sex TOD Pop Time
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 1~ 0 2011
2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 3~ 10 2011
3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 4~ 30 2011
4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 5~ 50 2011
5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HUDC ~ 0 2011
6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Lande~ 0 2011
7 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Condo~ 40 2011
8 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Others 0 2011
9 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 1~ 0 2011
10 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 3~ 10 2011
# ... with 984,646 more rows
As we can see from above, the two data frames have the same column names. We will firstly combine the two data frames into one by rows using rbind.
PA SZ AG
Length:2025248 Length:2025248 Length:2025248
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Sex TOD Pop
Length:2025248 Length:2025248 Min. : 0.00
Class :character Class :character 1st Qu.: 0.00
Mode :character Mode :character Median : 0.00
Mean : 38.41
3rd Qu.: 10.00
Max. :3160.00
Time
Min. :2000
1st Qu.:2005
Median :2010
Mean :2010
3rd Qu.:2015
Max. :2020
We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.
dfcombined <- dfcombined %>%
mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
AG == "5_to_9" ~ "05_to_09",
TRUE ~ AG) )
dfcombined <- dfcombined %>%
mutate(Pop1 = case_when(Sex == "Males" ~ -Pop, TRUE ~ Pop))
dfcombined_selected <- dfcombined %>%
select(- SZ, - TOD)
dfgrouped_selected <- dfcombined_selected %>%
group_by(PA, Time, AG, Sex) %>%
summarise(Pop1=sum(Pop1), .groups = 'keep') %>%
ungroup()
A static plot was firstly created using ggplot2 bar chart. There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.
Next, transition_time() method of gganimate extension is added on top of the static plot to determine the behavior of the animation. The transition should happen across time. Linear easing is chosen to define the pace of change.
The full code chunk and output are shown below.
ggplot(data=dfgrouped_selected,
aes(x = AG, y = Pop1, fill = Sex)) +
geom_col() +
xlab("Age") +
scale_y_continuous(name = "Population",
breaks = c(seq(-160000, 160000, 20000)),
labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
coord_flip() +
theme_bw() +
labs(title = 'Demographic Structure of Singapore 2000-2020\n\n {as.integer(frame_time)}',
caption = '\n\n Data Source: https://www.singstat.gov.sg/') +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
transition_time(Time) +
ease_aes('linear')
From above animation, we can see that Singapore population have been aging from 2000 to 2020 for both males and females as the Pyramid center moves higher with time passing by. There are more female elderly than male elderly.
In this section, we are going to zoom into more details to see how the demographic structure varies by planning area.
p <- dfgrouped_selected %>%
ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1),
"\nSex: ", Sex, "\nYear: ", Time))) +
xlab("Age") +
coord_flip() +
scale_y_continuous(name = "Population",
breaks = c(seq(-16000, 16000, 2000)),
labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
facet_wrap(~PA, ncol=2,scale = "free_x") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))
ggplotly(p, height = 10000, width = 800, tooltip = "text")
Given the size of number of planning areas, let’s focus our analysis on the top5 populated areas of 2000 and 2020.
data_2000 <- dfcombined_selected %>%
filter(Time == "2000")
data_2000_grouped <- data_2000 %>%
group_by(PA) %>%
summarise(`TotalPop` = sum(Pop)) %>%
ungroup()
Top5_2000 <- data_2000_grouped %>%
arrange(desc(TotalPop))
head(Top5_2000)
# A tibble: 6 x 2
PA TotalPop
<chr> <dbl>
1 Bedok 285440
2 Tampines 253320
3 Jurong West 204900
4 Hougang 204520
5 Woodlands 188870
6 Ang Mo Kio 181100
Top 5 populated areas of 2000 were “Bedok”, “Tampines”, “Jurong West”, “Hougang”, and “Woodlands”.
data_2020 <- dfcombined_selected %>%
filter(Time == "2020")
data_2020_grouped <- data_2020 %>%
group_by(PA) %>%
summarise(`TotalPop` = sum(Pop)) %>%
ungroup()
Top5_2020 <- data_2020_grouped %>%
arrange(desc(TotalPop))
head(Top5_2020)
# A tibble: 6 x 2
PA TotalPop
<chr> <dbl>
1 Bedok 277720
2 Jurong West 263050
3 Tampines 260380
4 Woodlands 255350
5 Sengkang 249670
6 Hougang 228130
Top 5 populated areas of 2020 were “Bedok”, “Jurong West”, “Tampines”, “Woodlands” and “Sengkang”.
dfgrouped_selected1 <- dfgrouped_selected %>%
filter(PA == "Bedok" | PA == "Jurong West" | PA == "Tampines" |
PA == "Woodlands" | PA =="Sengkang" | PA == "Hougang")
p1 <- dfgrouped_selected1 %>%
ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1),
"\nSex: ", Sex, "\nYear: ", Time))) +
xlab("Age") +
coord_flip() +
scale_y_continuous(name = "Population",
breaks = c(seq(-16000, 16000, 2000)),
labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
facet_wrap(~PA, ncol=2,scale = "free_x") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))
ggplotly(p1, height = 1500, width = 800, tooltip = "text")
For the plot above, we have below observations:
Out of total 57 planning areas, there have been 14 areas with no population since 2000. They are “Boon Lay”,“Boon Lay/Pioneer”, “Central Water Catchment”, “Changi Bay”, “Marina East”, “Marina South”, “Paya Lebar”, “Pioneer”, “Simpang”, “Straits View”, “Tengah”, “Tuas”, “Western Islands”, and “North East Island”.
There were no population in “Southern Islands” and “Punggol” in 2000. In 2020, Punggol was dominated by the middle-age and children. Few thousands of people were living in the Southern Islands in 2020.
“Bedok”, “Hougang”, “Jurong West”, and “Woodlands” demographic structure changes are similar to that of the whole Singapore.
“Sengkang” experienced a net growth of population from 2000-2020. There are less aging people in Sengkang compared to the other 5 planning areas. The dominating public there are the middle-age.
There are various method to define young and aged adult. For the purpose of this study, we define young adults to be between 20 and 39, based on Erikson’s research and define aged adults to be older than 65 (Wikipedia).
Mutate() is used to add new variables and preserve existing ones.
Select() is used to keep the desired columns.
group_by() is used to group data.
pivot_wider() is used to unwrap the age group.
dfcombined_selected1 <- dfcombined_selected %>%
select(- Pop1)
df3<- dfcombined_selected1 %>%
mutate(Group = case_when(AG == "20_to_24" |AG == "25_to_29"|AG == "30_to_34"|AG == "35_to_39" ~ "Young",
AG == "65_to_69" |AG == "70_to_74"|AG == "75_to_79"|AG == "80_to_84"|
AG == "85_to_89"|AG == "90_and_over" ~ "Old",
TRUE ~ "Others")) %>%
select(- AG, -Sex)%>%
group_by(Time, PA, Group) %>%
summarise(GroupPop = sum(Pop))%>%
ungroup() %>%
pivot_wider(names_from = Group, values_from = GroupPop) %>%
mutate (Total = Old+Others+Young) %>%
mutate (Pct_y = Young/Total*100) %>%
mutate (Pct_o = Old/Total*100)
df3
# A tibble: 1,155 x 8
Time PA Old Others Young Total Pct_y Pct_o
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 Ang Mo Kio 14430 107010 59660 181100 32.9 7.97
2 2000 Bedok 22960 173840 88640 285440 31.1 8.04
3 2000 Bishan 6170 56150 27960 90280 31.0 6.83
4 2000 Boon Lay/Pioneer 0 0 0 0 NaN NaN
5 2000 Bukit Batok 6140 77400 43140 126680 34.1 4.85
6 2000 Bukit Merah 18720 82820 47330 148870 31.8 12.6
7 2000 Bukit Panjang 4850 58290 33620 96760 34.7 5.01
8 2000 Bukit Timah 5250 40640 18700 64590 29.0 8.13
9 2000 Central Water Catchme~ 0 0 0 0 NaN NaN
10 2000 Changi 30 640 380 1050 36.2 2.86
# ... with 1,145 more rows
Next, we used the prepared data frame df3 to plot the animated bubble plot to show the evolution of % Young versus % Aged population.
ggplot(df3, aes(x = Pct_o, y = Pct_y,
size = Total,
colour = PA)) +
geom_point(alpha = 0.7,
show.legend = FALSE) +
scale_size(range = c(2, 15)) +
theme_bw() +
labs(title = 'Singapore Young and Aged Adults Share\n\n {as.integer(frame_time)}',
caption = '\n\n Data Source: https://www.singstat.gov.sg/',
x = '% Aged',
y = '% Young') +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
transition_time(Time) +
ease_aes('linear')
As we can see from above chart, % Aged increased and % Young decreased from 2000 to 2020. The decrease of % Young from 2010 onwards is more obvious, compared to that from 2000 to 2009.
Singapore is an aging population for both males and females. There are more female elderly than male elderly. Mature residential areas are representative of overall Singapore demographics changes over time. Recently developed planning areas, for example Punggol and Sengkang, have more young and middle-aged population than the matured planning areas.